Long non-coding RNA and circular RNA and coding RNA profiling of plasma exosomes of osteosarcoma by RNA seq

Osteosarcoma (OS) is a primary bone tumor with high malignancy and the mechanism of hematogenous metastasis in OS is still not clear. The plasma exosomes derived from osteosarcoma play a key role in the process of tumor metastasis. Here, we established RNA-seq dataset for lncRNAs, circRNAs and mRNAs in plasma exosomes from 10 OS patients and 5 healthy donors. A total of 329.52 Gb of clean data was obtained. Besides, 1754 lincRNAs, 7096 known and 1935 new circRNA was identified. Finally, gene expression profiles and differentially expressed genes (DEGs) were analyzed among these 15 samples. There were 331 DEGs of mRNA, 132 of lincRNA and 489 of circRNA was obtained, respectively. This data set provides a significant resource for relevant researchers to excavate potential dysregulated lncRNAs, circRNAs and mRNAs of plasma exosomes in OS versus normal conditions.


Background & Summary
Osteosarcoma (OS), which mainly occur in children and adolescents, is a primary bone tumor with high malignancy 1 . Despite great efforts have made in the past 20 years, it is still difficult to exceed the survival rate of OS to 60%. The main reason for this poor prognosis is that OS is prone to lung metastasis via hematogenous channel 2 . However, the mechanism of hematogenous metastasis in OS is still not clear.
Exosomes, a tiny extracellular vesicle with a diameter of 40-100 nm, can be released by various cells 3 . Exosomes contains many substances, such as long non-coding RNAs (lncRNAs), microRNAs (miRNA), circular RNAs (circRNAs) and protein. Thus, exosomes are regarded as the medium of intercellular communication, which are closely related to the many kinds of diseases 4 . It should be noted that tumor-derived exosomes can enter into blood and lead to tumor metastasis. Researches show that the plasma exosomes derived from osteosarcoma play a key role in the process of tumor metastasis 5 . However, the specific biological function of plasma exosomes to cancer metastasis remains unclear.
LncRNAs present linear chain structure with transcripts ≥200 nucleotides in length, while circRNAs is an enclosed circle structure without 3′-end and 5′-end. Although both lncRNAs and circRNAs belong to non-coding RNAs, they make a critical difference to regulation of mRNA 6 . LncRNAs, circRNAs and mRNAs are the main small molecular substance in exosomes, playing a crucial role in biological function of exosomes. A growing number of evidence show that exosomes secreted by primary tumor cells can carry these three types of molecules to the metastatic lesions and regulate the biological behavior of metastatic cells 7 . Thus, to make a profound study on these three types of RNA in plasma exosomes is essential to control cancer metastasis. RNA-seq is an effective and comprehensive method to explore RNA expression in exosomes. Data set of RNA seq of blood exosomes in various disease has been established 8 . However, transcriptome profiling of plasma exosomes in OS has not been systematically explored.
Subjects and sample collection. The experimental design and all workflow were presented in Fig. 1. To make the results more comparable, we developed elaborate inclusion and exclusion criteria for both patients and controls. Inclusion criteria: 1) patients were diagnosed with OS by postoperative pathological examination of resected specimens. 2) patients with tumor located in the extremities. 3) patients with initial treatment. Exclusion criteria: 1) patients and controls with infectious diseases, such as fever, hepatitis B infection, HIV, autoimmune disease and etc. 2) patients and controls with hemopathy, endocrine disease, and so on. 10 OS patients (6 female and 4 male) and 5 age-and sex-matched healthy controls were ultimately included in this study. According to surgical grading of bone tumor, 2 patients were classified into III, 5 patients into IIB, 2 patients into IB, and 1 patient into IIA, respectively (Table 1). 10 ml peripheral blood was collected from patients and controls in the morning with an empty stomach. Peripheral blood samples will be centrifuged at 300 min for 15 min to obtain blood plasma.
isolation of plasma exosomes. Firstly, plasma sample from individuals would be underwent a procedure of ultracentrifuged, which is similar to previous reports 9 . 7-fold phosphate-buffered saline (PBS) was added to the plasma sample to dilute the sample, centrifuged under condition of 13,000 × g for 30 minutes, and removed large particles via a 0.22 μm filter. The above supernatant would be centrifuged again for 2 hours at at 100,000 × g, 4 °C and then the exosomes pellet was obtained. The pellet was re-suspended with PBS and centrifuged again at 100,000 × g 4 °C for 2 h. Finally, 100 µl PBS were put into pellet to re-suspende the exosomes.
identification of exosomes. The isolated exosomes should be identified at three levels: nanoparticle tracking analysis (NTA), transmission electron microscopy (TEM), and western blot analysis (WB).
Nanoparticle tracking analysis (NTA). We used ZetaView PMX 110 (Particle Metrix, Meerbusch, Germany) equipped with 405 nm laser to determine the size and quantity of isolated particles. A video with a duration of 60 seconds was shot at a frame rate of 30 frames per second, and particle motion was analyzed using NTA software (ZetaView 8.02.28). Our result showed the mean diameter of exosomes was 105.3 nm and the main peak www.nature.com/scientificdata www.nature.com/scientificdata/ of diameter was located in 85.2 nm with a proportion 98.4%. Besides, the concentration was 1.5E + 12 particles/ mL ( Fig. 2a and Table 2).
Transmission electron microscopy (TEM). 10 µl exosomes sample of was placed on the copper mesh of electron microscope at room temperature for 1 min. After absorbing the floating liquid with filter paper, 15 µl of 2% uranyl acetate solution was took to stain for 1 min. The sample was then moved to an incandescent lamp and grilled  Quality validation of the raw data and library. Clean data (clean reads) were obtained by removing reads containing adapter, reads containing ploy-N and low-quality reads from raw data using Cutadapt software of version 2.10 (https://cutadapt.readthedocs.io/en/stable/). At the same time, Q20, Q30, GC-content and sequence duplication level of the clean data were calculated. The percentage of Q30 of each sample was not less than 88.74% in the present study, which released that the raw data in the present study was accredited (Table 3). After quality control of raw data, 329.52 Gb clean data was obtained.
In order to evaluate the quality of the library, we firstly estimate dispersion degree of estimated insert size. As showed in Fig. 3a   www.nature.com/scientificdata www.nature.com/scientificdata/ indicating a small degree of dispersion in the length of the inserted fragment, and the selection of the inserted fragment size is normal. Subsequently, randomness distribution map was plotted to understand the degradation of mRNA. As we can see in Fig. 3b, a 3′ bias was found. This is due to the discontinuous nature of RNA of exosomes, but not related to mRNA degradation. Finally, to evaluate whether the data is sufficient and meet subsequent analysis, saturation testing is performed on the gene numbers. The result demonstrated that the number of genes detected in each sample is saturated, indicating that the sequencing depth of this study is sufficient (Fig. 4).

Read alignment and transcript assembly.
We used the designated genome GRCh38 (ftp://ftp.ensembl. org/pub/release-101/fasta/homo_sapiens/) as a reference for sequence alignment and subsequent analysis. For lncRNAs and mRNAs, we used the HISAT2 software (version 2.2.1.0) 10 to align the reads and StringTie software (version 2.1.3) 11 to assemble the aligned reads. Considering the feature of circRNA sequencing data, we used Burrows-Wheeler-Alignment Tool (https://sourceforge.net/ projects/bio-bwa/files/) to complete sequence alignment. The read mapping results of lncRNA, mRNA and circRNA were summarized in Table 3.
lncRNA gene prediction and identification. Cuffcompare program from the Cufflinks package v2.2.1 was firstly applied to compare the assembled transcripts with the existing gene annotation. We compared the results of assembled transcripts with the known lncRNA, and removed the known transcripts (class_code = c). The unknown transcripts were used to screen for putative lncRNAs. Four computational software for coding capability prediction, including CPC 12 , CNCI 13 , Pfam 14 and CPAT 15 , were combined to sort non-protein coding RNA candidates from putative protein-coding RNAs in the unknown transcripts. The candidate lncRNA should meet the following conditions: (1)Transcription length ≥ 200 nt, (2) having more than two exons, (3) FPKM ≥ 0.1, 4) and class_code were "i","x","u","o","e". Menwhile, further screened using above four computational approaches that have the power to distinguish the protein-coding genes from the non-coding genes. As well as the different types of lncRNAs include lincRNA, intronic lncRNA, anti-sense lncRNA were selected using cuff compare. The intersection of prediction results of several tools was taken as the prediction result of new lncRNAs. We ultimately obtained 1754 lincRNAs (Fig. 5). Supplementary Table 1 showed the prediction of the names of the lincRNA.
CircRNA prediction and identification. Two algorithms of CIRI2 and find_circ were used to predict circRNA. In CIRI2 16 , by utilizing BWA software (Version: 7.12-r1039), the clean data was aligned to the reference gene set GRCh38 and the SAM files could be obtained. These SAM files were analyzed by CIRI2 software v2.0.5 (https://sourceforge.net/projects/ciri/files/CIRI2/) and ultimately gained the result of RNA prediction. Find_circ software (https://github.com/marvin-jens/find_circ/archive /v1.2.tar.gz) used bowtie2 (version 2.2.3) to complete the procedure of aligning and obtained predictive circRNA. The intersection gene obtained by the two algorithms will be used as the final circRNA. The predictive circRNA would be aligned to circBase database (http://www.circbase.org/) to identify known and new circRNA. We ultimately dentified 7096 known and 1935 new circRNA. Supplementary Tables 2, 3 17 . The screening criteria were |log2FC| ≥ 2 and pvalue < 0.05. Finally, there were 331 DEGs of mRNA ( Fig. 6a and Supplementary   Fig. 4 The saturation testing of the library. The result demonstrated that the number of genes detected in each sample is saturated, indicating that the sequencing depth of this study is sufficient. www.nature.com/scientificdata www.nature.com/scientificdata/  Table 5) and 489 of circRNA (Fig. 6c and Supplementary  Table 6), respectively. Besides, the DEGs and the name of top 20 DEGs of mRNA, lincRNA and circRNA were also showed in the volcano plot ( Fig. 6d-f).

Data Records
The FASTQ files for the raw data and BAM files have been deposited in NCBI Sequence Read Archive (SRA) under BioProject accession of PRJNA904243 and SRP409185 18 . The transcript abundance was deposited in the NCBI Gene Expression Omnibus (GEO) and the accession number was GSE218526 19 . The GEO accession number of each sample was listed in Table 1. The files of RNA prediction, expression abundance quantification and DEGs analysis was also deposited in figshare 20 .

Technical Validation
RNA integrity assessment. NanoDrop spectrophotometer was utilize to quickly determine the OD value of RNA samples at 260 and 280 nm. The ratio of A260/A280 range 1.8-2.0 was considered acceptable, otherwise it would be resampled. Subsequently, Agilent 2100 Bioanalyzer was used to evaluate RNA integrity.
Quality validation of RNA data. In order to obtain clean data, Cutadapt software of version 2.10 (https:// cutadapt.readthedocs.io/en/stable/) was first used to remove reads containing adapter. At the same time, Q20, Q30, GC-content and sequence duplication level of the clean data were calculated. Low quality read that met the following conditions would be excluded: 1) The reads in which the number of undetermined base pairs was greater than 10%; 2) Q30 < 88.74% . www.nature.com/scientificdata www.nature.com/scientificdata/

Code availability
Most steps were completed base on the public-domain software, except for the calculation of differential genes. All analytical code of DEGs is available on the GitHub repository (https://github.com/tangaode/Plasma-exosomes). The provided R code was run and tested using R 4.1.0. The name and the links of all database depositories was showed in Table 4.